Scaling Spark PG DB Beyond 500 GiB

Ideas:

Constraints:

If we run our DB in a different cloud provider than Fly.io, then we need to pay for bandwidth between spark services running on Fly.io and the DB server.

We can migrate spark service too, but that’s even more effort.

Migrate our DB to (hosted platform)

AWS RDS Aurora
- Estimated cost at current usage: 316$/month

Supabase

Neon

CockroachDB is too expensive

Use web3.storage for historical measurements

👉🏻 Pull request: https://github.com/filecoin-station/spark-api/pull/182

Roll up published_as CIDs into a new table and remove them from the table (keep only measurements for the last X hours.)

Downside:

Before we can run any more analysis on the SQL data, we need to build a tool to fetch measurements from web3.storage.

The data migration is going to be very CPU intensive and will take a long time to complete. We need to run it manually outside of the usual deployment/auto-migration process.

Algorithm:

In spark-publish, after we publish CID to the smart-contract, insert a new row (contract_address, round_index, cid)

Periodically, delete all measurements that have published_at field set and are older than X hours.

Initial migration:
For historical measurements where we don’t know the exact round index they were published in (because this round index can be different from what spark-api filled in the measurements table), assign their published_as CID to a “virtual” contract and round index, e.g. contract_address=0xUNKNOWN and round_index=0

Improvement:

To make it easier to filter CIDs based on date ranges, we can add a timestamp field to each (contract_address, round_index, cid) row.

For historical measurements, we can set this value as MAX(published_as) + INTERVAL '3 minutes' aggregated across all measurements published in the same batch.

Expected savings:

If each commitment contains ~50k measurements 600 bytes each, we will add 100 bytes per commitment and delete 30MB. We are publishing a new commitment every ~3 minutes, i.e. ~480 commitments per day = 14.4GB/day.

Compress data

https://medium.com/@lk.snatch/postgresql-compression-854a4647ee43

Postgres Pro

Compression file system